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Abstract 

I We propose a method for detecting differential gene expression that exploits the 

correlation between genes. Our proposal averages the univariate scores of each fea 

\D ' 

I ture with the scores in correlation neighborhoods. In a number of real and simulated 

I examples, the new method often exhibits lower false discovery rates than simple t- 

statistic thresholding. We also provide some analysis of the asymptotic behavior of 
^ ! our proposal. The general idea of correlation-sharing can be applied to other predic- 

5t , tion problems involving a large number of correlated features. We give an example in 

protein mass spectrometry. 
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1 Introduction 



We consider methods for detecting differentially expressed genes in from a set of microarray 
experiments. Consider the simple case of m genes measured across two experimental condi- 
tions. A number of authors have proposed methods for detecting differential gene expression, 
including ?, ? and ?. ? presents an interesting, more general approach. 

One widely used approach to this problem is as follows. We compute a two-sample t- 
statistic Tj for each gene, and then call a gene significant if |Tj| exceeds some threshold c. 
Various values of c are tried, using permutations of the sample labels to estimate the false 
discovery rate (FDR) for the procedure for each c. A threshold c is finally chosen based on 
the estimates of FDR and other considerations, such as the ballpark number of significant 
genes that is desirable. This recipe roughly describes the strategy used, for example, in the 
Significance of Microarrays (SAM) procedure (?). 

In this paper we propose a simple method for potentially improving on the thresholded 
t-statistic approach defined above. The idea is to exploit correlation among the genes. In a 
sense this general idea is not new, and exploratory methods based on clustering have been 
proposed (e.g. ?). These methods require choices like the clustering metric and linkage, and 
hence are somewhat subjective. The proposal presented here is much simpler, and hence it 
is easier to analyze and assess its performance. 

We start with t-statistics computed for each gene. Then we assign to each gene a score 
Ti equal to the average of all t-statistics for genes having correlation at least p{i) with that 
gene, choosing the best value of p{i) G [0, 1] to maximize the average. Finally, we call a gene 
significant if Inj exceeds some threshold c. The idea is that differentially expressed genes are 
fikely to co-exist in a pathway, and hence will be correlated in our data. Hence use of the 
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score Ti might provide a more accurate test of significance than that based on tj. We call 
this approach "correlation sharing" Note that the choice p{i) = 1 yields no sharing, giving 
Ti = ti. Hence the correlation-sharing method contains the thresholded t-statistic approach 
as a special case. 

As a motivating example, we generated data with 1000 genes and 30 samples. The first 
50 genes i E V = {1,2, . . . 50} are generated as 



with Zij ~ A^(0, 1) and corr(Zj, Zj/) + .0.8, where Zi = {Zn, . . . Zin) The remaining genes 
were generated as A^(0, 1). The outcome variable Yj equaled 2 for 16 < j < 30 
and 1 otherwise. 

Figure [T] shows the t-statistics (top panel) and correlation-shared t-statistics (bottom 
panel). We see that in the bottom panel the scores for the first 50 genes are magnified. This 
leads to improved detection of the differentially expressed genes, as we show in the next 
section. 

The outline of this paper is a follows. Section |21 defines correlation-sharing. In section 
El we discuss the concept of residual correlation, and its impact on correlation-sharing. We 
apply our method to four microarray cancer datasets. The skin data is examined more closely 
in section m Some asymptotic results for correlation sharing are given in section IHl Section 
El applies the method to a different kind of data — protein mass spectra. Finally in section 
[71 we discuss the application of correlation sharing to other kinds of response variables, and 
computational issues. 



Xi, = % + .75 ■ /(j > 15) 
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2 Correlation sharing 

Let X be the m x n matrix of expression values, for m genes and n samples. We assume 
that the samples fall into two groups j = 1 and 2. We start with th standard (unpaired) 
t-statistic 

Ti = (2) 

Si 

Here Xij is the mean of gene i in group j and Si = pooled within group standard deviation 
of gene i. 

Let Xi denote the ith row of X. Define Cp{i) = {k : corr(xj, Xk) > p}, the indices of the 
genes with correlation at least p with gene Xi Then we define 

Ui = max{o<p<i} avej(zCp{i)\Tj\ 

Vi = sign(Ti) ■ Ui (3) 

We call this the "correlation-shared" t-statistic. The method calls significant all genes hav- 
ing |rj| > c, and estimates the false discovery rate (FDR) of the resultant gene list by 
permutations. We vary c and examine the estimated FDR. 

Figure El shows the results for correlation sharing applied to the simulated data from 
model . As the threshold is varied, the number of genes called significant and the number 
of false positive genes and false negative genes all change. We see that correlation sharing 
generally yields fewer false positive and false negative genes genes than the t-statistic. 

We can also think of correlation-sharing as a method for supervised clustering. Let f){i) be 
the maximizing correlation for gene i, from definition (jS)). Then the set of genes with indices 
Cp(j)(z) is an adaptively chosen cluster, selected to maximize the average "signal" around 
gene i. Unlike with most standard clustering methods, the clusters Cp(i){i) are overlapping. 



rather than mutually disjoint. We examine these clusters in some examples later in this 
paper. 

As a second example, we changed the data generation so that the first 50 genes had 
no correlation, before the group effect was added. Figure El shows that the advantage of 
correlation sharing has disappeared. 

3 Residual correlation among non-null and null genes 

The previous example suggests that a key assumption in for our proposal is that the corre- 
lation between the non-null genes is higher than that for the null genes. 

We need to say precisely what we mean by "correlation". Suppose for a set of non-null 
genes the expression is /3 units higher in group Yj = 2 than it is in group Yj = 1: 

Xij = (3 ■ liYj = 2) + Sij for i G P 

= Eij foii^V (4) 

Let Xi = {xii, Xi2, ■ . . Xin). Then even if the errors Eij are all independent of one another, 
we have corr(xj,Xj') > for G V. That is, the treatment effect induces an overall 
correlation between the genes in V. However we would expect that the t-statistic would 
capture all of the information needed to decide if a gene is in P. 

Instead, we assume that there is residual correlation among the genes in V: 

corr(ej, e,/) > 0; for i,i' eV (5) 

where Ei = [eh, . . 

For the simulated data of Figure [H the estimated residual correlation is the correlation 
between genes, after having removed the estimated effect of treatment. Specifically, the 
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residual correlation is corr(x*,x*,) where x*j = Xij — Xij. For the two sample case, for 
example, Xij = Xi2 — Xn, Xik equaling the average of Xij for samples in group k. 

The average absolute residual correlation for the non-null genes (the first 50 genes) 
equaled 0.47, while that for the null genes was 0.15, and the correlation between the non-null 
and null genes was also 0.15. 

Is there residual correlation in real microarray data? Biologically, genes will be correlated 
if they are in the same pathway. However if that pathway is not active in the experimental 
conditions under study, the genes in the pathway will not show large correlation. And the 
same genes will tend to be null, i.e. will not differentially expressed in the experiment. The 
opposite should be true for differentially expressed genes. 

To see if this assumption is reasonable in practice, we examine four microarray datasets: 
the skin data taken from ?, and Duke breast cancer data taken from (?), the BRCA data 
taken from ? and the non-Hodgkins lymphoma data from ?. These are summarized in Table 
□ 

The false discovery rates of both the t-statistic and correlation-shared statistics depend 
on the total number of genes input into the corresponding procedure. Hence for fairness 
(and computational speed) we started with the 2000 genes having largest overall variance in 
each case. 

To examine residual correlation, we computed the two-sample t-statistics for each gene. 
Then we computed the average absolute residual correlation for genes satisfying |Tj| > c, 
with c varying from the 99th to the 75 quantiles of the |Tj| values. In the lymphoma data 
the outcome is survival time; hence we instead computed the Cox's partial likelihood score 
statistic for each gene (see section [7j). 

The results are shown in Figure 01 For the skin and lymphoma datasets data, the non- 
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null genes have higher correlation with each other than they have with the null genes, and 
also higher than that within the null-genes. But for the Duke and BRCA2 datasets, this is 
not the case. 

For the same four datasets, Figure El shows the estimated number of false positive genes is 
plotted against the number of genes called significant, for both the t-statistic and correlation 
shared t-statistic. Correlation sharing exhibits lower FDR for all datasets except the Duke 
data, where neither method does much as all. 

4 Skin data example 

We examine more closely the results for the skin data shown in the top left panel of Figure El 
There are 12,625 genes and 58 patients: 44 normal patients and 14 with radiation sensitivity. 

Figure (HI illustrates how correlation sharing can magnify the effect of a gene (#1127 
chosen as an example). The figure shows all genes having correlation at least 0.5 with gene 
7^ 1127. Its raw t-statistic is about 2.0 Notice that the genes most correlated with gene # 
1127 have greater scores than this gene. In particular, gene #1127 has correlation > 0.6 
with a gene having score about 4.7. Hence our procedure averages the scores of these two 
genes to produce a new score of about 3.8. 

Figure IHl shows the correlation-shared score versus the t-statistic score. Setting the cutoffs 
so that each method yields 100 significant genes, there are 13 genes which are called by each 
method and not called by the other. The red points represent the genes that are called 
significant by correlation-sharing but not by the t-statistic. Many of these genes are highly 
correlated with each other, and hence they boost up each other's score. 

In Figure M we do another test of our procedure. We randomly divided the samples 
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into equal-sized training and test sets. We computed the t-statistic and correlation sharing 
statistics on the training set, and also evaluated on the test set. For each trial cutpoint 
applied to the training set scores, we counted the number of genes with scores above or 
below this cutpoint in the test set. Genes above the cutpoint in the training set but below 
it in the test set were considered "false positives", and conversely for false negatives. The 
results in Figure IHl show that correlation sharing has fewer false negatives for the same 
number of false positives. 

5 Example: protein mass spectrometry 

This example (taken from (?)) consists of the intensities of 3160 peaks on 20 patients: 10 
healthy patients and 10 with Kawasaki's disease. They were measured on a SELDI protein 
mass spectrometer. 

Figure ^1 shows that correlation sharing offers a mild improvement in the false positive 
rate. 

For the 50 peaks having the top scores, 19 of these peaks were given neighborhoods of 
more than a single feature by the correlation sharing procedure. The smallest correlation 
chosen for neighborhood averaging was 0.7. Now in this example, each peak has an associated 
m/ z (mass over charge) location: this was not used in the correlation-sharing procedure, but 
we can look posthoc at the these values within each averaging neighborhood. FigurelTDshows 
the location of the each of the 19 peaks (horizontal axis) and the chosen neighbors (vertical 
axis). The corresponding neighborhood correlation is indicated along the top of the plot. 
We see that most often, the selected neighbors are close to the target peak. But in some 
cases, they can be very far apart. Some biological insights might emerge from examination 



8 



of these groups of peaks. 



6 Asymptotic Analysis 

In this section we show that, under appropriate conditions, correlation sharing improves 
power. More specifically, we show that for null genes, Ui has similar behavior to Tj, while 
for nonnuU genes, Ui tends to be stochastically larger than Tj. For simphcity, we focus on 
a one-sample, one-sided test. We denote by X^^ the measurement for gene i in sample k. 
Let Ti — Y^k=i ^'fe denote the test statistic for gene i and assume that Xik ~ -/V(A, cr^) 
where A = for null genes and that > for non-nulls. Let p{i,j) = corr^Xik, Xjk) denote 
the true residual correlation between gene i and gene j, p{i,j) denote the estimated residual 
correlation. 

The correlation-shared statistic is 



Throughout this section we make a small modification to the statistic which simplifies the 
analysis: we restrict the maximization in the definition oi Ui to be over correlation neigh- 
borhoods no larger than K, where K is some fixed integer. 

Recall that there are m genes and n observations. We require both m and n to grow 
in the asymptotic analysis. Typically, m is much larger than n so, to keep the asymptotics 
realistic, we allow n to grow very slowly relative to m. Specifically, we assume: 

Assumption (Al) : n = n{m) > Clogm for some sufficiently large C > 0. (8) 





(7) 
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Let V denote the nonnuU genes and let M — V^ denote the null genes. We will also need 
the following: 



Assumption (A2): There exist < 5 < 1 such that 

= maxp(i, j) = maxp(i, j) < 5 = min p(i,j). (9) 

Thus we make the strong assumption that there is positive residual correlation among the 
non-null genes, but no residual correlation among the null genes or between the non and 
non-null genes. This simplifies our analysis. Later, we will relax this assumption. 

LEMMA 1. Assume that (Al) holds. Fix e > 0. Then, for all large m, 

m.ax|p(i, j) - j)| < e a.s. (10) 

^3 

and 

max iTj — < e a.s. (11) 

i 

That is, p{i,j) = p{i,j) + o(l), uniformly over i,j, a.s. and Ti = (3 + o(l), uniformly over i, 
a.s. 

PROOF of Lemma 1. Kalisch and Biihlmann (2005) show that, 

niKhJ) - P{hj)\ > 6) < ci(n - 1) exp {-(n - 3) log((4 + e^)/(4 - e^))} (12) 

for some ci > 0. So, 

¥{me.x\p{ij) - p{ij)\ > e) (13) 
< 2m^ci(n- 1) X exp |-(n - 3) log((4 + e2)/(4 - e^))} < ^ (14) 
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where 

^ = 1 log((4 + e^)/(4 - e-)) - ^o^i^c,) - ^o^C - loglogm ^^^^ 
2 logm 

For C sufficiently large, (3 > 1. The ffist result then follows from the Borel-Cantelli Lemma. 
For the second result, apply Mill's inequality: 

P(max \Ti - Al > e) < me-'^^'/^"'. (16) 

i 

The result follows from assumption (jH)) and the Borel-Cantelli Lemma. ■ 



LEMMA 2. Assume (Al) and (A2). Then, for all p > 5 and each i e V, 

Cp{i)nAr = (lS a.s. (17) 

for all large m. Also, for every i G A/", 

Cp{{} n P = a.s. (18) 

for all p > 0. Thus, there are no nulls in the correlation neighborhoods Cp{i) of a non- 
null gene, except possibly for small p. Similarly, there are no nonnulls in the correlation 
neighborhoods Cp{i) of a null gene. 



6.1 The Oracle Statistic 

To understand the behavior of the correlation sharing statistic, it is helpful to ffist consider 
an oracle version of the statistic based on the true correlations. Let 



Kj = max ■ 



k E T, (19) 



where Vpii) = |j : p{i,j)>p\. (20) 
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Let us fix some nonnuU gene i E V and without loss of generality, take i — 1. Without 
loss of generality, label the genes so that 



p(l,2)>p(l,3)>--->p(l,m). (21) 



Then, 



'^i = ^f^Jl]^i = ^f^^I]^S]^i^- (22) 

1=1 i=l j=l 

1 1 " / 1 ' " \ 

- E - E (/^^ + = max f /3(r) + - E E ) (23) 

i=l j=l ^ i=l j=l ' 

max(^(r) + Z(r) ) (24) 



where 



/5(r) = ^EA (25) 

i=l 

is the Cesaro average and Z(-) is a mean zero Gaussian process with covariance kernel 

J(r,5) = — ^^p(i,^). (26) 



j=l A;=l 

The distribution of k\ is thus the distribution of the maximum of a noncentered, nonstation- 
ary Gaussian process. 

If /3(r) is strongly peaked around some value r*, then 

«i ~ ;9(r*) + Z(r,) = K ~ iV(;9(r,), J(r„ r,)). (27) 

Hence, 

P(fi;i >t)K, P(V; > t). (28) 
In particular, suppose that p(l, i) = p for i eV and p(l, i) = for i G A/". Then, 

V.~^(/3(r.),l^) (29) 
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and so 



> t) > P [x'M > (30) 
where Xi(^) is a noncentral Xi with noncentrahty parameter 

vr=!^^:^. (31) 
l + 2p ^ ^ 

In contrast, Ti has noncentrahty parameter n(3f. These heuristics imply that correlation 

sharing improves the power if 



' r \ 

> Pi, where r* = argmax^/3(r). (32) 



r^p (r,) 2 



l + 2p 

Figures IT^ and IT^ illustrate this analysis. The top plot in each figure is /3(r) and the 
bottom plot is the noncentrahty as a function of the size r of the correlation neighborhood. 

Figures IT^ shows a least favorable case in which jSi = 10 and /5j = 1 for z > 1, z G P. 
(In all cases we took p = .5). We call this least favorable since Ti has the largest mean; 
any averaging can only reduce its mean. Now, = 1 and Ti has noncentrahty 50. The 
randomness of p can lead to a correlation neighborhood larger than r^, = 1. If so, the 
noncentrahty parameter can be reduced as is evident from the steep decline of the curve in 
the second plot. 

Figure El shows a more realistic case. Here we used a random effects model and took 
Pi ~ A^(3,l). This makes Pr a random walk. Correspondingly, /5(r) behaves like a 
random walk for small r but settles down to a constant for large r. In this case, tends to 
be small but the noncentrahty grows rapidly. The result is a dramatic gain in noncentrahty. 
Also, the gain is robust to the choice of r. 

Now consider a null gene i G Af. Again take i = 1. Then, by assumption (A2), p(l, j) = 
for all j > 1. Hence, t'p(l) = {1} for all p > and k,i = Ti so the null distribution is 
unaffected by correlation sharing. 
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Let us now consider weakening (A2). Suppose we allow some small, nonzero correlation 
A among null genes. Change the definition of Ui to 



= rnax ^ (33) 

\Cp(i)\<K ' ''^ ^' jeCpii) 

Now replace (A2) with: 



Assumption (A2'): 



and 



minp(i, j) > maxp(i, j) (34) 
jev jev 



maxp(i,j) < A. (35) 

ieJ\f 

The analysis for nonnuU genes is virtually unchanged. For null genes, condition (A2') 
ensures that ki — Ti. An interesting extension is to estimate A from the data. We leave 
this to future work. 

6.2 Relationship Between Ui and the Oracle 

The analysis in the previous section ignores the variability of the p{i,j)'s. Now we relate Ki 
to Ui. 

First, under appropriate assumptions, we will show that for nonnuU genes, Ui is at least 
as large as k^. Suppose there exists a decreasing function / : [0, 1] — > [0, 1] with /(O) = 1, 
such that 

p(l,i) = /(i/m). (36) 

Suppose that / is a simple function, that is, / takes finitely many values oi > 02 > • • • > 
Ofc. The level sets Vp{l) = {j : p(l, j) > p} can only be of the form As — {j ■ p(l, j) > dg} 
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for s = 1, . . . , A;. Choose e > small. By Lemma 1, max^ |p(l, j) — j)| < e a-S. Let 

I = {p E [0, 1] : mirislp — a^l > e}. For all p E I, t'p(l) = C'p(l) a.s. Then, for all large m, 

= max 1 — ^— - Tj a.s. — max -; — T!,- (38) 

= (39) 

so that Ui is at least as large as k.^. 

Now we drop the assumption that / is simple and instead assume it is continuous and 
strictly decreasing. Similarly, assume there exists a continuous, integrable function g such 
that 

(3i = g{t/m). (40) 

Suppose that g{u) = s"^ g{u)du is maximized at some > 0. Let c = /(s*) and 
r = \i'c{l)\. Then, a.s. for all large m, 

= max^ ^ T, = max^ ^ + o(l) (41) 



Hence, 



1 1 /''^ 

= max- + o(l) = max- / g(u)du-\- oil) (42) 
r r ^ s s Jo 

= - / ^(i^)dii + o(l)=^(s,)+o(l). (43) 



'c(l)l ' 



Let i? = |Cc(l)|. From Lemma 1 and the assumptions on /, R/m = r /m + o(l) a.s. and 
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^ E (46) 



R 



R ^ ' R ^ ' R 

Pi)-ti)~^<^ P{l,!)>c p(l>i)<c 

p(l,i)<c p{l,i)>c 



p(l,i)>c 

= (49) 



Thus, Ui > Ki + o{l). 

Now suppose that i = 1 is a null gene. Fix a small e > 0. Under (A2), we eventually, 
have 

|{j>l: p(l,i)>e}| = (50) 



and hence 



Ui — Ki a.s. (51) 



The same holds under (A2'). 

7 Other issues 

Computation of the correlation shared statistic can be challenging when the number of fea- 
tures m is large. Brute force computation is 0{m'^). In principle, a KD tree can be used 
to quickly find the neighbors of a given point with correlation at least p. The building of 
the tree requires O(mlogm) computations, while the nearest neighbor search takes O(logm) 
computations. Hence the nearest neighbor search for all points requires O(mlogm) compu- 
tations. However, since the dimension of the feature space (n) is large n these problems (at 
least 50 or 100), the KD tree approach is not hkely to be effective in practice (J. Friedman, 
personal communication) . 
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Hence we instead do a direct brute force computation, exploiting the sparsity of the set 
of pairs of points with large correlation. The resulting procedure is quite fast, requiring for 
example 2.7s on the proteomics example (m = 3160, n = 20). 

The proposal of this paper can be applied to outcome measures other than two-class 
problems. We have seen this earlier in the lymphoma example, where the outcome was 
survival time. Other response types that may arise include a multi-class or quantitative 
outcome. The modification to the correlation-sharing technique is simple: the t-statistic © 
is simply replaced by a score that is appropriate for the outcome measure. For survival data, 
for example, we use the partial likelihood score statistic for each gene. This was illustrated 
in the lymphoma data of Table ^ 

Correlation-sharing provides a recipe for supervised clustering of features. Hence one 
might use correlation-sharing as a pre-processing step, by averaging the given features in the 
prescribed clusters. Then these averaged features could be used as input into a regression or 
classification procedure. This is a topic for future study. 
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Name 



Description 7^ Samples # Features Source 



Skin 


Two classes 


58 


12,625 


7 


Duke breast cancer 


Two classes 


49 


7097 


? 


BRCA 


Two classes 


15 


3226 


? 


Lymphoma 


Survival 


240 


7399 


? 



Table 1: Summary of datasets for Figure^ 



t-statistic 




Correlation-shared t-statistic 




Figure 1: T-statistics and correlation- shared T-statistics for simulated example 
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Figure 2: Results for example 1. Left panel: Number of false positive genes versus number of 
genes called significant. Right panel: Number of false negative genes versus number of genes called 
significant. 
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Figure 3: Results for example 2. Here the non-null genes have no correlation before the group 
effect is added. 
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Figure 4: Average absolute residual correlation as a function of the number of genes called signifi- 
cant by the T or F-statistics. 
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Skin data Duke data 
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Figure 5: Results for four cancer datasets: plotted is the number of false positive genes versus 
the number of genes called, for the standard t-statistic (red) and the correlation- shared t-statistic 
(green). The broken line is the 45° line. 
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Figure 6: Skin data: a closer look at gene 1127 
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Figure 7: Skin data: correlation- shared score versus number of genes used in each gene average; 
horizontal lines are drawn at cutpoints that yield 100 significant genes. Note that most of the 

significant genes use no averaging, and none use a window of more than 10 genes 
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Figure 8: Skin data: correlation- shared score versus t-statistic score. Broken lines are drawn at 
the cutoffs yielding 100 significant genes for each method. The red points are the the genes that are 
significant by correlation- sharing but not by t-statistic. 
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Figure 9: Skin data test set results. Here we formed cutojff rules on the training set, and assessed 
genes in in a separate test set. Shown are the number of false positive and negative genes in the 
test set, as the cutpoint is varied. 
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Figure 10: Results for protein mass spectrometry example 
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Figure 11: Protein mass spectrometry example: locations of neighbors of top 50 peaks, for those 
peaks that were given given neighborhoods of more than a single feature. The maximizing correla- 
tions are indicated at the top of the plot (note that in some cases there are multiple target peaks 
shows near the same position. 
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Figure 12: T/ie non-centrality parameter as a function of neighborhood size; least favorable 
case. The top plot is the cumulative average r~^Yl^j=iPj versus r. The bottom plot shows 
the noncentrality parameter versus r. The horizontal line shows the noncentrality parameter 
for Ti . For 1 < r < 80, the noncentrality parameter for Ti is larger than noncentrality 
parameter for Ui. Since the top plot is maximized at r = 1 we expect that the correlation 



neighborhood for Ui shoule have r close to 1. 
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Figure 13: The non-centmlity parameter as a function of neighborhood size; typical case. 



The top plot is the cumulative average r ^ X]f=i l^j versus r. The bottom plot shows the 



non- 



centrality parameter versus r. The horizontal line near shows the noncentrality parameter 
for Ti. The horizontal line near 100 shows the noncentrality parameter for Ui when the 
correaltion neighborood is r = 20 corresponding to the maximum of the top plot. Not only is 
there a large gain in noncentrality, but the gain is robust to fluctuations in r. 
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